08/15/2017

Get Contributor stats from git

TLDR;

Abstract: This article is about getting contributor stats from a git repository.

Solution:

  1. To get the number of commits for each user execute git shortlog -sn --all
  2. To get the number of lines added and delete by a specific user install q and then execute: git log --author="authorsname" --format=tformat: --numstat | q -t "select sum(c1), sum(c2) from -"

Conclusion:

  1. q is cool, put it in your toolbelt.
  2. Don't use those stats as a base for calculating salaries.

Overview

Last week I wanted to retrieve simple stats about contributors from a git repository.

I came up with the following two stats:

  1. Commits per Contributor
  2. Lines changed per Contributor

Stats from git

Commits per Contributor

Getting the number of commits for each contributor is easy with git, just execute:

git shortlog -sn --all

and you will get an output like this:

  3  author

Which will show you the number of commits for each user.

The command is broken up as follows:

  • git shortlog summarizes git log
  • -s suppresses the description of the commits and shows only the commit count
  • -n sorts the output by most commits descending
  • --all show it for all branches

Lines changed per Contributor

The second thing I was interested in, was the number of lines changed.

Git is able to tell you the number of lines changed per file for each commit. When we now restrict the shown commits only to a specific author, we will be able to get a list of all changes he or she did for all files.

This can be accomplished with the following git command: git log --author="authorsname" --format=tformat: --numstat

The command can be broken down like so:

  • git log shows info about commits
  • --author="name" shows only info about a specific author. You could also use --committer="name" if your author and committer are always the same.
  • --format=tformat: is a nice one, it uses an empty tformat string to basically get rid of every information, so this outputs an empty string for each commit.
  • --numstat adds the number of lines added and delete for each file of the commit.

This leaves us with one remaining problem: How to sum up all those single lines in case we don't care about the name of those files? Well, we will need to use another tool to do it, the question is, which one?

I had the following ideas:

  1. Write a custom shell script by glueing together different commandline tools
  2. Write a custom commandline tool
  3. Use Excel

So I looked at all of them:

  1. A shell script would need to use additional tools and would probably be pretty brittle - it would break on different versions of the tools used, depend on a specific shell and have very slim chances of working cross platform. If I am forced to write something, new I'd rather write a small commandline tool then.
  2. Nah, on second thought, I don't want to do that either, I was looking for something more out of the box for such a generic task.
  3. Are you kidding me, we are at the commandline here?!

Then it dawned on me: What I wanted to do was calculate aggregates over columns - and there is already a great way to express that: SQL!

The question is, how can I run a sql statement against that output without the overhead of pushing it to relational database first?

This is were the great little command tool q (https://github.com/harelba/q) comes in!

It can run a sql query against data coming from csv and STDIN. The command that I used (including the sql query that calculates the sums of all added and deleted files) for q is q -t "select sum(c1), sum(c2) from -".

The query can broken down as follows:

  • -t use tab as the separator between columns
  • sum(c1) the sum of the first column
  • sum(c2) the sum of the second column
  • from - from STDIN

Now we only need to pipe the output of the git command into our call of q.

Therefore to get the number of lines changed per contributor you need to:

  1. Install q as explained here.
  2. Execute

    git log --author="authorsname" --format=tformat: --numstat | q -t "select sum(c1), sum(c2) from -"
  3. And you will get an output like this:

    4       1

    which is the number of added and deleted rows.

Svn

If you are working with svn, you could try the git svn bridge and clone a repo from svn first, like so: git svn clone url_to_your_repository

Conclusion

While getting stats like commits per contributor and lines changed is rather easy using git, those stats may not be as useful as you think for measuring developer contribution:

  1. They are not comparable in and off themselves. Comparing the number of added F# lines with added lines to a XML Config file is maybe not the best idea.

  2. They can easily be gamed. For example by adding and removing additional lines, and splitting changes over multiple commits and multiple lines - you will get what you measure.

  3. Last but not least: They measure almost always the wrong thing. Nobody cares about lines and commits - imho far more important are code coverage, issues solved, good communication, on boarding new team members and so on.

So take stats like this with a grain of salt, or better a spoon and don't base any important decisions upon them!

On one hand, maybe those stats aren't so great after all, on their own, they seem more like vanity metrics instead of actionable data. But on the other hand, I've put a new and flexible tool into my toolbelt today :)

References

Last updated 08/15/2017 17:51:53
blog comments powered by Disqus
Questions?
Ask Martin